Wine Quality Prediction

WINE!!!

Through this project we embark on an exhilarating journey through the world of wine, where we explore the delicate art of predicting wine quality. With each sip and swirl, we delve into the enchanting realm of data analysis and visualization, uncovering the hidden gems within the chemical attributes that define wine excellence.


LOADING THE DATASETS

wine=read.csv("winequalityN.csv")
xkabledplyhead(wine)
Head
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
white 7.0 0.27 0.36 20.7 0.045 45 170 1.001 3.00 0.45 8.8 6
white 6.3 0.30 0.34 1.6 0.049 14 132 0.994 3.30 0.49 9.5 6
white 8.1 0.28 0.40 6.9 0.050 30 97 0.995 3.26 0.44 10.1 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
white 7.2 0.23 0.32 8.5 0.058 47 186 0.996 3.19 0.40 9.9 6
xkabledplytail(wine)
Tail
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
6493 red 6.2 0.600 0.08 2.0 0.090 32 44 0.995 3.45 0.58 10.5 5
6494 red 5.9 0.550 0.10 2.2 0.062 39 51 0.995 3.52 NA 11.2 6
6495 red 6.3 0.510 0.13 2.3 0.076 29 40 0.996 3.42 0.75 11.0 6
6496 red 5.9 0.645 0.12 2.0 0.075 32 44 0.996 3.57 0.71 10.2 5
6497 red 6.0 0.310 0.47 3.6 0.067 18 42 0.996 3.39 0.66 11.0 6
#xkablesummary(wine)

We have succesfully loaded the dataset. We can see that there are 6497 observations and 13 variables.

Next lets look at the summary statistics.


SUMMARY STATISTICS

xkablesummary(wine)
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:6497 Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.22 Min. : 8.0 Min. :3.00
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.250 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 17.0 1st Qu.: 77 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.43 1st Qu.: 9.5 1st Qu.:5.00
Median Mode :character Median : 7.00 Median :0.29 Median :0.310 Median : 3.0 Median :0.047 Median : 29.0 Median :118 Median :0.995 Median :3.21 Median :0.51 Median :10.3 Median :6.00
Mean NA Mean : 7.22 Mean :0.34 Mean :0.319 Mean : 5.4 Mean :0.056 Mean : 30.5 Mean :116 Mean :0.995 Mean :3.22 Mean :0.53 Mean :10.5 Mean :5.82
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.40 3rd Qu.:0.390 3rd Qu.: 8.1 3rd Qu.:0.065 3rd Qu.: 41.0 3rd Qu.:156 3rd Qu.:0.997 3rd Qu.:3.32 3rd Qu.:0.60 3rd Qu.:11.3 3rd Qu.:6.00
Max NA Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289.0 Max. :440 Max. :1.039 Max. :4.01 Max. :2.00 Max. :14.9 Max. :9.00
NA NA NA’s :10 NA’s :8 NA’s :3 NA’s :2 NA’s :2 NA NA NA NA’s :9 NA’s :4 NA NA

A quick look at the summary tells us the inter-quartile ranges and maximum and minimum values for each variable.

Observations:

  1. There are also some NA values which we need to remove.

  2. There might be duplicates in the dataset.


CLEANING THE DATA

Removing Duplicates

wine <- unique(wine)
xkablesummary(wine)
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5329 Min. : 3.80 Min. :0.08 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1.0 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.22 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16.0 1st Qu.: 75 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.43 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.30 Median :0.310 Median : 2.7 Median :0.047 Median : 28.0 Median :116 Median :0.995 Median :3.21 Median :0.51 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.34 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30.1 Mean :114 Mean :0.995 Mean :3.22 Mean :0.53 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.41 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41.0 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.60 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.58 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289.0 Max. :440 Max. :1.039 Max. :4.01 Max. :2.00 Max. :14.9 Max. :9.0
NA NA NA’s :10 NA’s :8 NA’s :3 NA’s :2 NA’s :2 NA NA NA NA’s :9 NA’s :4 NA NA

Duplicates can cause issues later with biases. Thus, we have removed the duplicate values.

Now, the dataset has 5329 observations.

Removing NA Values

wine <- na.omit(wine)
# str(wine)
xkablesummary(wine)
Table: Statistics summary.
type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
Min Length:5295 Min. : 3.80 Min. :0.080 Min. :0.000 Min. : 0.6 Min. :0.009 Min. : 1 Min. : 6 Min. :0.987 Min. :2.72 Min. :0.220 Min. : 8.0 Min. :3.0
Q1 Class :character 1st Qu.: 6.40 1st Qu.:0.230 1st Qu.:0.240 1st Qu.: 1.8 1st Qu.:0.038 1st Qu.: 16 1st Qu.: 74 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.430 1st Qu.: 9.5 1st Qu.:5.0
Median Mode :character Median : 7.00 Median :0.300 Median :0.310 Median : 2.7 Median :0.047 Median : 28 Median :116 Median :0.995 Median :3.21 Median :0.510 Median :10.4 Median :6.0
Mean NA Mean : 7.22 Mean :0.344 Mean :0.319 Mean : 5.1 Mean :0.057 Mean : 30 Mean :114 Mean :0.995 Mean :3.22 Mean :0.533 Mean :10.6 Mean :5.8
Q3 NA 3rd Qu.: 7.70 3rd Qu.:0.410 3rd Qu.:0.400 3rd Qu.: 7.5 3rd Qu.:0.066 3rd Qu.: 41 3rd Qu.:154 3rd Qu.:0.997 3rd Qu.:3.33 3rd Qu.:0.600 3rd Qu.:11.4 3rd Qu.:6.0
Max NA Max. :15.90 Max. :1.580 Max. :1.660 Max. :65.8 Max. :0.611 Max. :289 Max. :440 Max. :1.039 Max. :4.01 Max. :2.000 Max. :14.9 Max. :9.0

After, removing the NAs, we are finally left with 5295 observations. The dataset is now clean.

Now, the data is clean. We have removed the NAs, and duplicates. We can now start our exploratory data analysis.

Observations:

  1. There is a big difference between maximum value and Q3 values. This means that there are a quite a lot of outliers.

Let’s visualize this using plots.


UNIVARIATE PLOTS

To understand in depth as to which factor effects the quality the most.

Let’s look at the individual variables in the data set.

ggplot(data = wine, aes(x = quality)) +
  geom_bar(width = 0.8, color = 'black', fill = I('yellow')) +
  labs(
    title = "Overall Wine Quality",
    x = "Quality",
    y = "Data - Red & white wine"
  )

Observations:

  1. Wine quality shows a rather symmetrical distribution.

  2. Most wines have a quality score of 6.

  3. No wine achieved the highest score of 10 and the worst wines got a rating of 3.

Let’s see how the other factors are effecting!

p1 <- ggplot(data = wine, aes(x = citric.acid)) +
  geom_bar(fill = I('blue')) +
  labs(
    title = "Citric Acidity",
    x = "Concentration [g/L]",
    y = "Data"
  )

p2 <- ggplot(data = wine, aes(x = pH)) +
  geom_bar( fill = I('blue')) +
  labs(
    title = "pH",
    x = "pH",
    y = "Data"
  )
p3 <- ggplot(data = wine, aes(x = residual.sugar)) +
  geom_histogram(binwidth = 1,  fill = I('blue')) +
  labs(
    title = "Residual Sugar",
    x = "Residual Sugar (g/L)",
    y = "Data"
  )
p4 <- ggplot(data = wine, aes(x = density)) +
  geom_histogram(binwidth = 0.002,  fill = I('blue')) +
  labs(
    title = "Density",
    x = "Density",
    y = "Data"
  )
p5 <- ggplot(data = wine, aes(x = chlorides)) +
  geom_histogram(binwidth = 0.005,  fill = I('blue')) +
  labs(
    title = "Chlorides",
    x = "Chloride Content (g/L)",
    y = "Data"
  )
p6 <-  ggplot(data = wine, aes(x = alcohol)) +
  geom_histogram(binwidth = 1,  fill = I('blue')) +
  labs(
    title = "Alcohol Content",
    x = "Alcohol Content (% by volume)",
    y = "Data"
  )
grid.arrange(p1,p2,p3,p4,p5,p6, nrow = 3)

p7 <- ggplot(data = wine, aes(x = fixed.acidity)) +
  geom_bar( fill = I('blue')) +
  labs(
    title = "Fixed Acidity",
    x = "TaOH Concentration [g/L]",
    y = "Data"
  )

 p8 <- ggplot(data = wine, aes(x = volatile.acidity)) +
  geom_bar(  fill = I('blue')) +
  labs(
    title = "Volatile Acidity",
    x = "AcOH Concentration [g/L]",
    y = "Data"
  )
 
 p9 <- ggplot(wine, aes(x = free.sulfur.dioxide)) +
  geom_histogram(binwidth = 5,  fill = I('blue')) +
  labs(
    title = "Free Sulfur Dioxide Concentration",
    x = "Concentration (mg/L)",
    y = "Data"
  )

 p10 <- ggplot(wine, aes(x = total.sulfur.dioxide)) +
  geom_histogram(binwidth = 20, fill = "blue") +
  labs(
    title = "Total Sulfur Dioxide Concentration",
    x = "Concentration (mg/L)",
    y = "Data"
  )

grid.arrange(p7,p8, p9,p10, nrow = 2)

Observations:

  1. Most distributions encountered during the exploration of the parameters looked rather usual. In general, they were positively skewed with a narrow main peak.

We will also have a look at at the box plots for these!

p1 <- ggplot(data = wine, aes(x = "", y = fixed.acidity )) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Fixed Acidity",
    y = "TaOH Concentration [g/L]"
  )

p2 <- ggplot(data = wine, aes(x = "", y = volatile.acidity)) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Volatile Acidity",
    y = "AcOH Concentration [g/L]"
  )

 p3 <- ggplot(data = wine, aes(x = "", y = citric.acid)) +
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "Citric Acidity",
    y = "Concentration [g/L]"
  )

 p4 <- ggplot(data = wine, aes(x = "", y = pH))+
  geom_boxplot(color = 'black', fill = I('white')) +
  labs(
    x = "pH",
    y = "pH"
  )
grid.arrange(p1,p2, p3,p4, nrow = 1)

Observations:

  1. Residual sugar has a very long-tail distribution with many outliers. It will be interesting to see how these outliers affect the quality of wine.

  2. Chlorides have distribution similar to residual sugar and have a strong concentration around the median. We also note a lot of outliers from the box plot.

  3. Most wines have less than 11% alcohol.

  4. Density has a very normal looking distribution with most of the values falling between 0.995 and 1.


SUMMARY OF UNIVARIATE PLOTS

  1. In general, the variables were positively skewed with a narrow main peak.

  2. Most wines have a pH of 3.2. Since we have chlorides,citric acid, and fixed and volatile acidity, the wines were bound to be on the acidic side.

  3. The wines have an alcohol content ranging between 8 and 15 vol%.

Now let us see that how different factors are related to the quality!


CORRELATION MATRIX

First, we will build correlation matrix to identify the variables which influence quality the most.

numeric_data <- subset(wine, select = -c(type))
redd <- subset(wine, type == "red")
numeric_datared <- subset(redd, select = -c(type))
whited <- subset(wine, type == "white")
numeric_datawhite <- subset(whited, select = -c(type))
loadPkg("corrplot")
cor_matrix <-cor(numeric_data)
corrplot(cor_matrix, method="circle",type="upper")

numeric_data <- subset(wine, select = -c(type))

Observations:

Our Target Variable is Quality, so we will focus on only those parameters which influence quality.

  1. Alcohol and quality have a high positive correlation.

  2. Density and quality have a negative correlation.

  3. Low correlation between quality and chloride concentration.

  4. Wine quality also slightly negatively correlates with volatile acidity.

  5. It has a slight positive correlation with citric acid.

From the correlation matrix, let us now see how these parameters vary over different quality ratings for red and white wine separately and find out how they are different.


SUBSETTING DATA INTO RED AND WHITE WINEs

# Subset the dataset into red and white wines
red <- wine[wine$type == "red", ]
white <- wine[wine$type == "white", ]

We will now subset the data into white and red wine separately to go ahead with our analysis.


BIVARIATE PLOTS

Through our correlation plot we have understood that Alcohol content, density, citric acid and chloride are the ones that are affecting the quality the most, let us see how and also make a comparitive analysis between red and white wine individually!

1. ALCOHOL VS QUALITY

library(ggplot2)
ggplot(wine, aes(x = as.factor(quality), y = alcohol)) +
  geom_boxplot(fill = "brown", color = "darkblue") +
  labs(x = "Wine Quality", y = "Alcohol") +
  ggtitle("Box Plot of Alcohol vs Quality")

The boxplot shows that wines with higher quality seem to have a higher alcohol content.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in alcohol content.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the two wines.

t_test_alcohol <- t.test(red$alcohol,white$alcohol, level=0.05)

p <- t_test_alcohol$p.value

Observations:

  1. Since the p-value of 3.606^{-6} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol concentration between the two wines.

Let us check how it varies for red and white wine.

library(ggplot2)
red <- subset(wine, type == "red")
white <- subset(wine, type == "white")
ggplot() +
  
  geom_boxplot(data = red, aes(x = as.factor(quality), y = alcohol, fill = "red"), width = 0.4) +
  labs(x = "Wine Quality", y = "Alcohol", fill = "Wine Color") +
  ggtitle("Box Plot of Alcohol vs Quality for Red Wines") +
  scale_fill_manual(values = c("red" = "red"))

ggplot() +
  geom_boxplot(data = white, aes(x = as.factor(quality), y = alcohol, fill = "white"), width = 0.4) +
  labs(x = "Wine Quality", y = "Alcohol", fill = "Wine Color") +
  ggtitle("Box Plot of Alcohol vs Quality for White Wines") +
  scale_fill_manual(values = c("white" = "white"))

Observations:

  1. White Wines have higher alcohol content.

  2. Alcohol has a strong positive correlation with quality.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in red wine

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in red wine

anova_result <- aov(alcohol ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
                 Df Sum Sq Mean Sq F value Pr(>F)    

as.factor(quality) 5 438 87.6 103 <2e-16 *** Residuals 1347 1144 0.8
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)
summary(tukey_test)
               Length Class  Mode   

as.factor(quality) 60 -none- numeric

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in white wine

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in white wine

anova_result <- aov(alcohol ~  as.factor(quality), data = white)
print(summary(anova_result))
                 Df Sum Sq Mean Sq F value Pr(>F)    

as.factor(quality) 6 1467 244.5 220 <2e-16 *** Residuals 3935 4377 1.1
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)
tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = alcohol ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.147 -0.8867 0.5934 0.997 5-3 -0.479 -1.1809 0.2223 0.405 6-3 0.304 -0.3957 1.0035 0.861 7-3 1.175 0.4692 1.8808 0.000 8-3 1.535 0.7882 2.2821 0.000 9-3 1.835 0.2794 3.3906 0.009 5-4 -0.333 -0.6009 -0.0644 0.005 6-4 0.451 0.1876 0.7134 0.000 7-4 1.322 1.0427 1.6006 0.000 8-4 1.682 1.3109 2.0527 0.000 9-4 1.982 0.5675 3.3957 0.001 6-5 0.783 0.6660 0.9003 0.000 7-5 1.654 1.5046 1.8040 0.000 8-5 2.014 1.7278 2.3011 0.000 9-5 2.314 0.9199 3.7087 0.000 7-6 0.871 0.7313 1.0111 0.000 8-6 1.231 0.9496 1.5130 0.000 9-6 1.531 0.1378 2.9245 0.020 8-7 0.360 0.0634 0.6568 0.006 9-7 0.660 -0.7365 2.0564 0.805 9-8 0.300 -1.1179 1.7176 0.996

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.


2. DENSITY VS QUALITY

library(ggplot2)
# Density Distribution by Wine Quality
ggplot(wine, aes(x = factor(quality), y = density)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "Quality", y = "Density") +
  ggtitle("Density Distribution by Wine Quality")+
  ylim(0, 1.01)  # Adjust the limits as needed

The boxplot shows that wines with higher quality seem to have a less denser.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in density level.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in density level.

t_test_density <- t.test(red$density,white$density, level=0.05)

p <- t_test_density$p.value
print(p)

[1] 6.92e-322

Observations:

  1. Since the p-value of 7^{-322} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean densitylevel between the two wines.

Let us check how it varies for red and white wine

# Box Plot of Density vs Quality for Red Wines
ggplot() +
  geom_boxplot(data = red, aes(x = as.factor(quality), y = density, fill = "light coral"), width = 0.4) +
  labs(x = "Wine Quality", y = "Density", fill = "light coral") +
  ggtitle("Box Plot of Density vs Quality for Red Wines") +
  scale_fill_manual(values = c("light coral" = "light coral"))

# Box Plot of Density vs Quality for White Wines
ggplot() +
  geom_boxplot(data = white, aes(x = as.factor(quality), y = density, fill = "red"), width = 0.4) +
  labs(x = "Wine Quality", y = "Density", fill = "red") +
  ggtitle("Box Plot of Density vs Quality for White Wines") +
  scale_fill_manual(values = c("red" = "red"))

Observations:

  1. Red Wines are more dense than white wines.

  2. Density has a negative correlation with quality.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in red wine.

anova_result <- aov(density ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
                 Df  Sum Sq  Mean Sq F value Pr(>F)    

as.factor(quality) 5 0.00021 4.27e-05 12.8 4e-12 *** Residuals 1347 0.00451 3.40e-06
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)
tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = density ~ as.factor(quality), data = red, conf.level = 0.95)

$as.factor(quality) diff lwr upr p adj 4-3 -0.000926 -0.002730 8.77e-04 0.686 5-3 -0.000380 -0.002047 1.29e-03 0.987 6-3 -0.000886 -0.002553 7.82e-04 0.654 7-3 -0.001413 -0.003114 2.88e-04 0.167 8-3 -0.002369 -0.004451 -2.87e-04 0.015 5-4 0.000546 -0.000210 1.30e-03 0.309 6-4 0.000041 -0.000718 8.00e-04 1.000 7-4 -0.000486 -0.001316 3.43e-04 0.550 8-4 -0.001442 -0.002902 1.73e-05 0.055 6-5 -0.000505 -0.000819 -1.91e-04 0.000 7-5 -0.001033 -0.001492 -5.73e-04 0.000 8-5 -0.001988 -0.003274 -7.03e-04 0.000 7-6 -0.000527 -0.000991 -6.39e-05 0.015 8-6 -0.001483 -0.002770 -1.96e-04 0.013 8-7 -0.000956 -0.002286 3.74e-04 0.314

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in white wine.

anova_result <- aov(density ~  as.factor(quality), data = white)
print(summary(anova_result))
                 Df  Sum Sq  Mean Sq F value Pr(>F)    

as.factor(quality) 6 0.00458 0.000764 105 <2e-16 *** Residuals 3935 0.02872 0.000007
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)

tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = density ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.000695 -0.00259 1.20e-03 0.934 5-3 0.000182 -0.00161 1.98e-03 1.000 6-3 -0.001160 -0.00295 6.33e-04 0.475 7-3 -0.002824 -0.00463 -1.02e-03 0.000 8-3 -0.003139 -0.00505 -1.23e-03 0.000 9-3 -0.003424 -0.00741 5.61e-04 0.147 5-4 0.000877 0.00019 1.56e-03 0.003 6-4 -0.000465 -0.00114 2.09e-04 0.392 7-4 -0.002129 -0.00284 -1.41e-03 0.000 8-4 -0.002444 -0.00339 -1.49e-03 0.000 9-4 -0.002729 -0.00635 8.93e-04 0.283 6-5 -0.001342 -0.00164 -1.04e-03 0.000 7-5 -0.003006 -0.00339 -2.62e-03 0.000 8-5 -0.003321 -0.00406 -2.59e-03 0.000 9-5 -0.003606 -0.00718 -3.44e-05 0.046 7-6 -0.001664 -0.00202 -1.31e-03 0.000 8-6 -0.001979 -0.00270 -1.26e-03 0.000 9-6 -0.002264 -0.00583 1.30e-03 0.500 8-7 -0.000315 -0.00108 4.45e-04 0.885 9-7 -0.000600 -0.00418 2.98e-03 0.999 9-8 -0.000285 -0.00392 3.35e-03 1.000

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density level across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.


3. CHLORIDES VS QUALITY

ggplot(wine, aes(x = factor(quality), y = chlorides)) +
  geom_boxplot(fill = "lightcoral") +
  labs(x = "Quality", y = "Chlorides") +
  ggtitle("Chloride Distribution by Wine Quality") +
  ylim(0, 0.2)

The boxplot shows that wines with higher quality seem to have a less chlorides.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in chloride concentration.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in chloride concentration.

t_test_chloride <- t.test(red$chlorides,white$chlorides, level=0.05)

p <- t_test_chloride$p.value
print(p)

[1] 3.71e-159

Observations:s

  1. Since the p-value of 3.706^{-159} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration between the two wines.

Let us check how it varies for red and white wine

ggplot(red, aes(x = factor(quality), y = chlorides)) +
  geom_boxplot(fill = "yellow") +
  labs(x = "Quality", y = "Chlorides") +
  ggtitle("Chloride Distribution by Wine Quality (Red Wine)") +
  ylim(0, 0.3)

ggplot(white, aes(x = factor(quality), y = chlorides)) +
  geom_boxplot(fill = "coral") +
  labs(x = "Quality", y = "Chlorides") +
  ggtitle("Chloride Distribution by White Wine Quality") +
  ylim(0, 0.2)

Observations:

  1. Red Wines have more chloride concentration than white wines.

  2. Chloride Concentration has a slight negative correlation with quality

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in red wine.

anova_result <- aov(chlorides ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
                 Df Sum Sq Mean Sq F value  Pr(>F)    

as.factor(quality) 5 0.06 0.01286 5.34 7.2e-05 *** Residuals 1347 3.24 0.00241
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)
tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = chlorides ~ as.factor(quality), data = red, conf.level = 0.95)

$as.factor(quality) diff lwr upr p adj 4-3 -0.03244 -0.0808 1.59e-02 0.393 5-3 -0.02851 -0.0732 1.62e-02 0.452 6-3 -0.03730 -0.0820 7.40e-03 0.164 7-3 -0.04567 -0.0913 -7.67e-05 0.049 8-3 -0.05415 -0.1100 1.66e-03 0.063 5-4 0.00394 -0.0163 2.42e-02 0.994 6-4 -0.00485 -0.0252 1.55e-02 0.984 7-4 -0.01323 -0.0355 9.01e-03 0.534 8-4 -0.02170 -0.0608 1.74e-02 0.610 6-5 -0.00879 -0.0172 -3.65e-04 0.035 7-5 -0.01716 -0.0295 -4.85e-03 0.001 8-5 -0.02564 -0.0601 8.82e-03 0.276 7-6 -0.00837 -0.0208 4.05e-03 0.389 8-6 -0.01685 -0.0514 1.77e-02 0.731 8-7 -0.00848 -0.0441 2.72e-02 0.984

Observations

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant differnece in chloride concentration.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in white wine.

anova_result <- aov(chlorides ~  as.factor(quality), data = white)
print(summary(anova_result))
                 Df Sum Sq Mean Sq F value Pr(>F)    

as.factor(quality) 6 0.114 0.0190 37.7 <2e-16 *** Residuals 3935 1.987 0.0005
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)

tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = chlorides ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.004162 -0.01993 0.011603 0.987 5-3 -0.001961 -0.01691 0.012985 1.000 6-3 -0.009171 -0.02407 0.005733 0.538 7-3 -0.016760 -0.03180 -0.001725 0.018 8-3 -0.017544 -0.03346 -0.001633 0.020 9-3 -0.026900 -0.06004 0.006239 0.201 5-4 0.002201 -0.00351 0.007915 0.917 6-4 -0.005009 -0.01061 0.000592 0.115 7-4 -0.012598 -0.01854 -0.006655 0.000 8-4 -0.013382 -0.02128 -0.005481 0.000 9-4 -0.022738 -0.05286 0.007386 0.281 6-5 -0.007210 -0.00970 -0.004714 0.000 7-5 -0.014799 -0.01799 -0.011609 0.000 8-5 -0.015583 -0.02169 -0.009476 0.000 9-5 -0.024939 -0.05464 0.004765 0.168 7-6 -0.007589 -0.01057 -0.004609 0.000 8-6 -0.008373 -0.01437 -0.002373 0.001 9-6 -0.017729 -0.04741 0.011953 0.574 8-7 -0.000784 -0.00710 0.005536 1.000 9-7 -0.010140 -0.03989 0.019609 0.953 9-8 -0.009356 -0.03956 0.020846 0.970

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in chloride concentration.


4. CITRIC ACID VS QUALITY

ggplot(wine, aes(x = factor(quality), y = citric.acid)) +
  geom_boxplot(fill = "lightpink") +
  labs(x = "Quality", y = "Citric Acid") +
  ggtitle("Citric Acid Distribution by Wine Quality")+
 ylim(0.0,0.15)

The boxplot shows that wines with higher quality seem to have a high citric acid.

T-Test

WELCH TWO SAMPLE T-TEST

NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in citric acid concentration.

ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in citric acid concentration.

t_test_citric <- t.test(red$citric.acid,white$citric.acid, level=0.05)

p <- t_test_citric$p.value
print(p)

[1] 1.33e-26

Observations:

  1. Since the p-value of 1.335^{-26} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid concentration between the two wines.

Let us check how it varies for red and white wine

ggplot(red, aes(x = factor(quality), y = citric.acid)) +
  geom_boxplot(fill = "green") +
  labs(x = "Quality", y = "Citric Acid") +
  ggtitle("Citric Acid Distribution by Red Wine Quality")+
  ylim(0.0,0.15)

  ggplot(white, aes(x = factor(quality), y = citric.acid)) +
  geom_boxplot(fill = "maroon") +
  labs(x = "Quality", y = "Citric Acid") +
  ggtitle("Citric Acid Distribution by White Wine Quality")+
  ylim(0.0,0.15)

Observations:

  1. White Wines have more citric acid concentration than red wines.

  2. Citric ACid Concentration has a slight positive correlation with quality.

  3. There isn’t much difference in citric acid concentration in white wines across the quality ratings.

ANOVA TEST FOR RED WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in red wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in red wine.

anova_result <- aov(citric.acid ~  as.factor(quality), data = red, conf.level = 0.95)
print(summary(anova_result))
                 Df Sum Sq Mean Sq F value  Pr(>F)    

as.factor(quality) 5 2.9 0.584 16.1 1.8e-15 *** Residuals 1347 48.8 0.036
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)
tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = citric.acid ~ as.factor(quality), data = red, conf.level = 0.95)

$as.factor(quality) diff lwr upr p adj 4-3 0.00477 -0.18279 0.1923 1.000 5-3 0.07377 -0.09948 0.2470 0.830 6-3 0.10949 -0.06389 0.2829 0.465 7-3 0.20086 0.02402 0.3777 0.015 8-3 0.21194 -0.00453 0.4284 0.059 5-4 0.06901 -0.00965 0.1477 0.124 6-4 0.10472 0.02579 0.1836 0.002 7-4 0.19609 0.10983 0.2823 0.000 8-4 0.20717 0.05542 0.3589 0.001 6-5 0.03572 0.00304 0.0684 0.023 7-5 0.12708 0.07934 0.1748 0.000 8-5 0.13817 0.00450 0.2718 0.038 7-6 0.09137 0.04318 0.1396 0.000 8-6 0.10245 -0.03138 0.2363 0.246 8-7 0.01108 -0.12720 0.1494 1.000

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid across all categories of red wine.

  2. We also have done Tukey Test to check in which quality ratings, there is a significant difference in citric acid level.

ANOVA TEST FOR WHITE WINE

NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in white wine.

ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in white wine.

anova_result <- aov(citric.acid ~  as.factor(quality), data = white)
print(summary(anova_result))
                 Df Sum Sq Mean Sq F value Pr(>F)  

as.factor(quality) 6 0.2 0.0363 2.43 0.024 * Residuals 3935 58.8 0.0150
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

tukey_test <- TukeyHSD(anova_result)

tukey_test

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = citric.acid ~ as.factor(quality), data = white)

$as.factor(quality) diff lwr upr p adj 4-3 -0.031789 -0.117583 0.05400 0.930 5-3 0.000279 -0.081059 0.08162 1.000 6-3 0.002416 -0.078687 0.08352 1.000 7-3 -0.008569 -0.090389 0.07325 1.000 8-3 -0.000962 -0.087551 0.08563 1.000 9-3 0.050000 -0.130341 0.23034 0.983 5-4 0.032068 0.000969 0.06317 0.038 6-4 0.034205 0.003727 0.06468 0.016 7-4 0.023220 -0.009118 0.05556 0.342 8-4 0.030828 -0.012172 0.07383 0.344 9-4 0.081789 -0.082144 0.24572 0.762 6-5 0.002137 -0.011441 0.01572 0.999 7-5 -0.008848 -0.026203 0.00851 0.743 8-5 -0.001241 -0.034472 0.03199 1.000 9-5 0.049721 -0.111925 0.21137 0.972 7-6 -0.010985 -0.027202 0.00523 0.416 8-6 -0.003378 -0.036030 0.02927 1.000 9-6 0.047584 -0.113944 0.20911 0.977 8-7 0.007608 -0.026787 0.04200 0.995 9-7 0.058569 -0.103320 0.22046 0.938 9-8 0.050962 -0.113390 0.21531 0.970

Observations:

  1. From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid level across all categories of red wine.

CORRELATION FOR THE BI-VARIATE PLOTS.

library(corrplot)
white <- na.omit(white[, c("citric.acid", "chlorides","density",  "alcohol")])

red <- na.omit(red[, c("citric.acid", "chlorides","density", "alcohol")])

correlation_matrix_white <- cor(white)
correlation_matrix_red <- cor(red)

combined_correlation_matrix <- (correlation_matrix_white + correlation_matrix_red) / 2

corrplot(combined_correlation_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45, addCoef.col = "red")

Observations:

Data cleaning is performed on both white and red in the above code, deleting rows with missing values in certain columns. Correlation matrices are subsequently computed for the cleaned datasets. The correlation matrices from the white and red wine datasets are averaged in the combined correlation matrix. A correlation plot with a customized appearance is built.


SUMMARY OF BIVARIATE PLOTS

From the bivariate plots, we have concluded that :

  1. For good quality wines, the alcohol content is more.

  2. As the density decreases, the quality gets better.

  3. As the chloride concentration decreases, quality gets better.

  4. For better wines, the citric acid concentration is more.

  5. Whites wines generally have less density and have more alcohol conent.


MULTIVARIATE PLOTS

For the last part of our EDA, we will perform some multivariate plots to see some how the other non-important featues in wine are distributed in red and white wine.

library(ggplot2)
ggplot(wine, aes(x = residual.sugar, y = density, color = factor(type))) +
  geom_point() +ggtitle("Scatter plot for Density vs Sugar ") +
  labs(x = "Sugar", y = "Density", color = "Wine Type")

ggplot(wine, aes(x = alcohol, y = density, color = factor(type))) +
  geom_point() +  ggtitle("Scatter plot for Density vs Alcohol ") +
  labs(x = "Alcohol", y = "Density", color = "Wine Type")

 ggplot(wine, aes(x = alcohol, y = chlorides, color = factor(type))) +
  geom_point() + ggtitle("Scatter plot for Chlorides vs Alcohol ") +
  labs(x = "Alcohol", y = "Chlorides", color = "Wine Type")

ggplot(wine, aes(x = sulphates, y = residual.sugar, color = factor(type))) +
  geom_point() +ggtitle("Scatter plot for Sulphates vs Sugar ") +
  labs(x = "Sulphates", y = "Residual Sugar", color = "Wine Type")

Observations:

  1. White Wines have more sugar concentration than red wines. This might explain why white wines are usually more sweet.

  2. Red Wines have more sulphate concentration.


LIMITATIONS

Although we thoroughly believe in our analysis, we have to mention a few anomalies that are present that may or may not have influenced the results.

  1. The number of data that we have on white wine is comparatively more than that of red wine.

  2. We have also observed that most of the data present are in the average quality range i.e, from 4-8.

For a better analysis, we need a more balanced data.


CONCLUSIONS FROM EDA

In our preliminary analysis, we have uncovered some key insights:

  1. Wines with elevated alcohol content, increased citric acid levels, lower density, and reduced chlorides tend to exhibit higher quality.

  2. It appears that white wines, in general, tend to be sweeter and have higher alcohol content when compared to their red counterparts.

With this new found knowledge, you’re better equipped to make informed choices when selecting wines!!


MODELLING THE DATA

From our EDA, we have identified some key features which influence the target variable- Quality.

As part of our next analysis, we will now perform some regression models to see which model predicts the quality of wine with most accuracy.


1. LINEAR REGRESSION


2. LOGISTIC REGRESSION